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Abstract 

Dropout is a popular technique for regularizing artificial neural networks. Dropout 
networks are generally trained by minibatch gradient descent with a dropout mask 
turning off some of the units—a different pattern of dropout is applied to every 
sample in the minibatch. We explore a very simple alternative to the dropout mask. 
Instead of masking dropped out units by setting them to zero, we perform matrix 
multiplication using a submatrix of the weight matrix—unneeded hidden units are 
never calculated. Performing dropout batchwise, so that one pattern of dropout is 
used for each sample in a minibatch, we can substantially reduce training times. 
Batchwise dropout can be used with fully-connected and convolutional neural net¬ 
works. 


1 Independent versus batchwise dropout 

Dropout is a technique to regularize artificial neural networks—it prevents overfitting 
la. A fully connected network with two hidden layers of 80 units each can learn to 
classify the MNIST training set perfectly in about 20 training epochs—unfortunately 
the test error is quite high, about 2%. Increasing the number of hidden units by a factor 
of 10 and using dropout results in a lower test error, about 1.1%. The dropout network 
takes longer to train in two senses; each training epoch takes several times longer, 
and the number of training epochs needed increases too. We consider a technique for 
speeding up training with dropout—it can substantially reduce the time needed per 
epoch. 

Consider a very simple £-layer fully connected neural network with dropout. To 
train it with a minibatch of h samples, the forward pass is described by the equations: 

Xk+i = [xk ■dk\xWk k = 


1 


Here Xk is a b x Uk matrix of input/hidden/output units, dk is a b x Uk dropout-mask 
matrix of independent Bernoulli(l — pk) random variables, pk denotes the probability 
of dropping out units in level k, and Wk is an Uk x Uk+i matrix of weights connecting 
level k with level k + 1. We are using • for (Hadamard) element-wise multiplication 
and X for matrix multiplication. We have forgotten to include non-linear functions 
(e.g. the rectifier function for the hidden units, and softmax for the output units) but for 
the introduction we will keep the network as simple as possible. 

The network can be trained using the backpropagation algorithm to calculate the 
gradients of a cost function (e.g. negative log-likelihood) with respect to the Wk' 



With dropout training, we are trying to minimize the cost function averaged over an 
ensemble of closely related networks. However, networks typically contain thousands 
of hidden units, so the size of the ensemble is much larger than the number of training 
samples that can possibly be ‘seen’ during training. This suggests that the indepen¬ 
dence of the rows of the dropout mask matrices dk might not be terribly important; the 
success of dropout simply cannot depend on exploring a large fraction of the available 
dropout masks. Some machine learning libraries such as Pylearn2 allow dropout to 
be applied batchwise instead of independentljQ This is done by replacing dk with a 
Ixuk row matrix of independent Bernoulli {1—pk) random variables, and then copying 
it vertically b times to get the right shape. 

To be practical, it is important that each training minibatch can be processed quickly. 
A crude way of estimating the processing time is to count the number of floating point 
multiplication operations needed (naively) to evaluate the x matrix multiplications 
specified above: 



forwards 


dco&t / dW 


backwards 


However, when we take into account the effect of the dropout mask, we see that many 
of these multiplications are unnecessary. The (i, j)-th element of the Wk weight matrix 
effectively ‘drops-out’ of the calculations if unit i is dropped in level k, or if unit j is 
dropped in level A: -I- 1. Applying 50% dropout in levels k and k + 1 renders 75% of 
the multiplications unnecessary. 

If we apply dropout independently, then the parts of Wk that disappear are dif¬ 
ferent for each sample. This makes it effectively impossible to take advantage of the 
redundancy—it is slower to check if a multiplication is necessary than to just do the 
multiplication. However, if we apply dropout batchwise, then it becomes easy to take 
advantage of the redundancy. We can literally drop-out redundant parts of the calcula¬ 
tions. 

*Pyleam2: see function apply .dropout in mlp.py 
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Figure 1; Left; MNIST training time for three layer networks (log scales) on an 
NVIDIA GeForce GTX 780 graphics card. Right; Percentage reduction in training 
times moving from no dropout to batchwise dropout. The time saving for the SOON 
network with minibatches of size 100 increases from 33% to 42% if you instead com¬ 
pare batchwise dropout with independent dropout. 


The binary 1 x Uk batchwise dropout matrices dk naturally define submatrices of 
the weight and hidden-unit matrices. Let •= Xk['-, dk] denote the submatrix 

of Xk consisting of the level-A: hidden units that survive dropout. Let — 

Wk[dk,dk+i] denote the submatrix of Wk consisting of weights that connect active 
units in level k to active units in level k + 1. The network can then be trained using the 
equations; 


dropout _ dropout 

^k+l — 


X ^dropout 


i9cost 

^^dropout 

cicost 

^^dropout 




i9cost 

o dropout 


i9cost 

o dropout 
X (Vpdropout^T 


The redundant multiplications have been eliminated. There is an additional benefit 
in terms of memory needed to store the hidden units; needs less space than 

Xk- In Section]^ we look at the performance improvement that can be achieved using 
CUDA/CUBLAS code running on a GPU. Roughly speaking, processing a minibatch 
with 50% batchwise dropout takes as long as training a 50% smaller network on the 
same data. This explains the nearly overlapping pairs of lines in Figure[2 

We should emphasize that batchwise dropout only improves performance during 
training; during testing the full Wk matrix is used as normal, scaled by a factor of 
1—pk- However, machine learning research is often constrained by long training times 
and high costs of equipment. In Section]^ we show that all other things being equal, 
batchwise dropout is similar to independent dropout, but faster. Moreover, with the 
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increase in speed, all other things do not have to be equal. With the same resources, 
batchwise dropout can be used to 

• increase the number of training epochs, 

• increase the number of hidden units, 

• increase the number of validation runs used to optimize “hyper-parameters”, or 

• to train a number of independent copies of the network to form a committee. 

These possibilities will often be useful as ways of improving generalization/reducing 
test error. 

In Section]^ we look at batchwise dropout for convolutional networks. Dropout 
for convolutional networks is more complicated as weights are shared across spatial 
locations. A minibatch passing up through a convolutional network might be repre¬ 
sented at an intermediate hidden layer by an array of size 100 x 32 x 12 x 12: 100 
samples, the output of 32 convolutional filters, at each of 12 x 12 spatial locations. It 
is conventional to use a dropout mask with shape 100 x 32 x 12 x 12; we will call this 
independent dropout. In contrast, if we want to apply batchwise dropout efficiently by 
adapting the submatrix trick, then we will effectively be using a dropout mask with 
shape 1 X 32 X 1 X 1. This looks like a significant change: we are modifying the 
ensemble over which the average cost is optimized. During training, the error rates are 
higher. However, testing the networks gives very similar error rates. 

1.1 Fast dropout 

We might have called batchwise dropout fast dropout but that name is already taken 
CD. Fast dropout is very different approach to solving the problem of training large 
neural network quickly without overfitting. We discuss some of the differences of the 
two techniques in the appendix. 

2 Implementation 

In theory, for n x n matrices, addition is an O(n^) operation, and multiplication is 
0 (j^ 2.37...) jjjg Coppersmith-Winograd algorithm. This suggests that the bulk of 
our processing time should be spent doing matrix multiplication, and that a perfor¬ 
mance improvement of about 60% should be possible compared to networks using in¬ 
dependent dropout, or no dropout at all. In practice, SGEMM functions use Strassen’s 
algorithm or naive matrix multiplication, so performance improvement of up to 75% 
should be possible. 

We implemented batchwise dropout for fully-connected and convolutional neural 
networks using CUDA/CUBLAl^ We found that using the highly optimized CUblasS- 
gemm function to do the bulk of the work, with CUDA kernels used to form the sub¬ 
matrices and to update the Wk using worked well. Better 

^Software available at http://www2.warwick.ac.uk/fac/sci/statistics/staff/academic-research/graham/ 


4 




performance may well be obtained by writing a SGEMM-like matrix multiplication 
function that understands submatrices. 

For large networks and minibatches, we found that batchwise dropout was substan¬ 
tially faster, see Figure The approximate overlap of some of the lines on the left 
indicates that 50% batchwise dropout reduces the training time in a similar manner to 
halving the number of hidden units. 

The graph on the right show the time saving obtained by using submatrices to im¬ 
plement dropout. Note that for consistency with the left hand side, the graph compares 
batchwise dropout with dropout-free networks, not with networks using independent 
dropout. The need to implement dropout masks for independent dropout means that 
Figure 1 slightly undersells the performance benefits of batchwise dropout as an alter¬ 
native to independent dropout. 

For smaller networks, the performance improvement is lower—bandwidth issues 
result in the GPU being under utilized. If you were implementing batchwise dropout 
for CPUs, you would expect to see greater performance gains for smaller networks as 
CPUs have a lower processing-power to bandwidth ratio. 


2.1 Efficiency tweaks 

If you have n = 2000 hidden units and you drop out p = 50% of them, then the 
number of dropped units is approximately np = 1000, but with some small variation as 
you are really dealing with a Binomial(n, p) random variable—its standard deviation is 
^ynp(Y^^]^ = 22.4. The sizes of the submatrices ^nd ^re therefore 

slightly random. In the interests of efficiency and simplicity, it is convenient to remove 
this randomness. An alternative to dropping each unit independently with probability p 
is to drop a subset of exactly np of the hidden units, uniformly at random from the set 
of all such subsets. It is still the case that each unit is dropped out with probability 
p. However, within a hidden layer we no longer have strict independence regarding 
which units are dropped out. The probability of dropping out the first two hidden units 
changes very slightly, from 


p^ = 0.25 to 


np np — 1 
n n — 1 


0.24987.... 


Also, we used a modified form of NAG-momentum minibatch gradient descent E). 
After each minibatch, we only updated the elements of not all the element 

of Wfc. With Vk and denoting the momentum matrix/submatrix corresponding 

to Wk and our update was 


^dropout ^ ^^dropout _ _ ^)acost/9fUf 

TTrdropout TTrdropout . dropout 


The momentum still functions as an autoregressive process, smoothing out the gradi¬ 
ents, we are just reducing the rate of decay ^ by a factor of (1 — pfc)(l — Pk+i)- 
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Number of dropout patterns used 


Figure 2: Dropout networks trained using a restricted the number of dropout patterns 
(each X is from an independent experiment). The blue line marks the test error for a 
network with half as many hidden units trained without dropout. 

3 Results for fully-connected networks 

The fact that batchwise dropout takes less time per training epoch would count for 
nothing if a much larger number of epochs was needed to train the network, or if a 
large number of validation runs were needed to optimize the training process. We 
have carried out a number of simple experiment to compare independent and batchwise 
dropout. In many cases we could have produced better results by increasing the training 
time, annealing the learning rate, using validation to adjust the learning process, etc. 
We choose not to do this as the primary motivation for batchwise dropout is efficiency, 
and excessive use of hne-tuning is not efficient. 

For datasets, we used: 

• The MNISTj^set of 28 x 28 pixel handwritten digits. 

• The CIFAR-10 dataset of 32x32 pixel color pictures (PI). 

• An artihcial dataset designed to be easy to overht. 

Following (U, for MNIST and CIFAR-10 we trained networks with 20% dropout in 
the input layer, and 50% dropout in the hidden layers. For the artificial dataset we 
increased the input-layer dropout to 50% as this reduced the test error. In some cases, 
we have used relatively small networks so that we would have time to train a number 
of independent copies of the networks. This was useful in order to see if the apparent 
differences between batchwise and independent dropout are signihcant or just noise. 

^http://y ann.lecun.com/exdb/mnist/ 
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3.1 MNIST 


Our first experiment explores the effect of dramatically restricting the number of dropout 
patterns seen during training. Consider a network with three hidden layers of size 1000, 
trained for 1000 epochs using minibatches of size 100. The number of distinct dropout 
patterns, is so large that we can assume that we will never generate the same 

dropout mask twice. During independent dropout training we will see 60 million dif¬ 
ferent dropout patterns, during batchwise dropout training we will see 100 times fewer 
dropout patterns. 

For both types of dropout, we trained 12 independent networks for 1000 epochs, 
with batches of size 100. For batchwise dropout we got a mean test error of 1.04% 
[range (0.92%,1.1%), s.d. 0.057%] and for independent dropout we got a mean test 
errors of 1.03% [range (0.98%, 1.08%), s.d. 0.033%]. The difference in the mean test 
errors is not statistically significant. 

To explore further the reduction in the number of dropout patterns seen, we changed 
our code for (pseudo)randomly generating batchwise dropout patterns to restrict the 
number of distinct dropout patterns used. We modified it to have period n minibatches, 
with n = 1, 2,4,8,...; see Figure For n = 1 this corresponds to only ever us¬ 
ing one dropout mask, so that 50% of the network’s 3000 hidden weights are never 
actually trained (and 20% of the 784 input features are ignored). During training this 
corresponds to training a dropout-free network with half as many hidden units—the test 
error for such a network is marked by a blue line in Figure]^ The error during testing 
is higher than the blue line because the untrained weights add noise to the network. 

If n is less than thirteen, is it likely that some of the networks 3000 hidden units 
are dropped out every time and so receive no training. If n is in the range thirteen to 
fifty, then it is likely that every hidden unit receives some training, but some pairs of 
hidden units in adjacent layers will not get the chance to interact during training, so 
the corresponding connection weight is untrained. As the number of dropout masks 
increases into the hundreds, we see that it is quickly a case of diminishing returns. 

3.2 Artificial dataset 

To test the effect of changing network size, we created an artificial dataset. It has 100 
classes, each containing 1000 training samples and 100 test samples. Each class is de¬ 
fined using an independent random walk of length 1000 in the discrete cube {0,1}^°°°. 
For each class we generated the random walk, and then used it to produce the training 
and test samples by randomly picking points along the length of walk (giving binary se¬ 
quences of length 1000) and then randomly flipping 40% of the bits. We trained three 
layer networks with n G {250, 500,1000, 2000} hidden units per layer with mini¬ 
batches of size 100. See Figure]^ 

Looking at the training error against training epochs, independent dropout seems to 
learn slightly faster. Flowever, looking at the test errors over time, there does not seem 
to be much difference between the two forms of dropout. Note that the x-axis is the 
number of training epochs, not the training time. The batchwise dropout networks are 
learning much faster in terms of real time. 
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Figure 3: Artificial dataset. 100 classes each corresponding to noisy observations of a 
one dimensional manifold in {0, 

3.3 CIFAR-10 fully-connected 

Learning ClFAR-10 using a fully connected network is rather difficult. We trained 
three layer networks with n S {125,250, 500,1000, 2000} hidden units per layer with 
minibatches of size 1000. We augmented the training data with horizontal flips. See 
Figure 

4 Convolutional networks 

Dropout for convolutional networks is more complicated as weights are shared across 
spatial locations. Suppose layer k has spatial size Sk x Sk with Uk features per spatial 
location, and if the fc-th operation is a convolution with / x / filters. For a minibatch 
of size b, the convolution involves arrays with sizes; 

layer k : b x Uk x Sk x Sk 
weights Wk : Uk+i x Uk x f x f 

Dropout is normally applied using dropout masks with the same size as the layers. We 
will call this independent dropout—independent decisions are mode at every spatial 
location. In contrast, we dehne batchwise dropout to mean using a dropout mask with 
shape 1 X rife X 1 X 1. Each minibatch, each convolutional hlter is either on or off— 
across all spatial locations. 

These two forms of regularization seem to be doing quite different things. Con¬ 
sider a hlter that detects the color red, and a picture with a red truck in it. If dropout is 
applied independently, then by the law of averages the message “red” will be transmit¬ 
ted with very high probability, but with some loss of spatial information. In contrast. 
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Figure 4; Results for CIFAR-10 using fully-connected networks of different sizes. 

with batchwise dropout there is a 50% chance we delete the entire hlter output. Exper¬ 
imentally, the only substantial difference we could detect was that batchwise dropout 
resulted in larger errors during training. 

To implement batchwise dropout efficiently, notice that the 1 x x 1 x 1 dropout 
masks corresponds to forming subarrays of tjjg weight arrays Wk with size 

(1 - pk+i)nk+i X (1 - Pk)nk X f X f. 

The forward-pass is then simply a regular convolutional operation using that 

makes it possible, for example, to take advantage of the highly optimized cudnnConvolution Forward 

function from the NVIDIA cuDNN package. 

4.1 MNIST 

For MNIST, we trained a LeNet-5 type CNN with two layers of 5 x 5 filters, two 
layers of 2 x 2 max-pooling, and a fully connected layer |j6l. There are three places for 
applying 50% dropout: 

50 % 50 % 50 % 

32C5-MP2 - 64C5-MP2 - 512N - ION. 

The test errors for the two dropout methods are similar, see Figure]^ 

4.2 CIFAR-10 with varying dropout intensity 

For a first experiment with CIFAR-10 we used a small convolutional network with 
small filters. The network is a scaled down version of the network from m; there are 
four places to apply dropout: 

128C3 - MP2 - 256C2 - MP2 - 384C2 - MP2 - 512N - ION. 
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Figure 5: MNIST test errors, training repeated three times for both dropout methods. 

The input layer is 24 x 24. We trained the network for 1000 epochs using randomly 
chosen subsets of the training images, and reflected each image horizontally with prob¬ 
ability one half. For testing we used the centers of the images. 

In Figure we show the effect of varying the dropout probability p. The training 
errors are increasing with p, and the training errors are higher for batchwise dropout. 
The test-error curves both seem to have local minima around p = 0.2. The batchwise 
test error curve seems to be shifted slightly to the left of the independent one, sug¬ 
gesting that for any given value of p, batchwise dropout is a slightly stronger form of 
regularization. 

4.3 CIFAR-10 with many convolutional layers 

We trained a deep convolutional network on CIFAR-10 without data augmentation. 
Using the notation of ||2|, our network has the form 

(6471(72 - FMP^2)i 2 - 832(72 - 896(71 - output, 

i.e. it consists of 12 2 x 2 convolutions with 64n filters in the n-th layer, 12 layers 
max-pooling, followed by two fully connected layers; the network has 12.6 million pa¬ 
rameters. We used an increasing amount of dropout per layer, rising linearly from 0% 
dropout after the third layer to 50% dropout after the 14th. Even though the amount 
of dropout used in the middle layers is small, batchwise dropout took less than half as 
long per epoch as independent dropout; this is because applying small amounts of in¬ 
dependent dropout in large hidden-layers creates a bandwidth performance-bottleneck. 

As the network’s max-pooling operation is stochastic, the test errors can be reduced 
by repetition. Batchwise dropout resulted in a average test error of 7.70% (down to 
5.78% with 12-fold testing). Independent dropout resulted in an average test error of 
7.63% (reduced to 5.67% with 12-fold testing). 
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Figure 6: CIFAR-10 results using a convolutional network with dropout probability 
p G (0, 0.4). Batchwise dropout produces a slightly lower minimum test error. 


5 Conclusions and future work 

We have implemented an efficient form of batchwise dropout. All other things being 
equal, it seems to learn at roughly the same speed as independent dropout, but each 
epoch is faster. Given a hxed computational budget, it will often allow you to train 
better networks. 

There are other potential uses for batchwise dropout that we have not explored yet: 

• Restricted Boltzmann Machines can be trained by contrastive divergence ||3l with 
dropout 10. Batchwise dropout could be used to increase the speed of training. 

• When a fully connected network sits on top of a convolutional network, train¬ 
ing the top and bottom of the network can be separated over different computa¬ 
tional nodes JS). The fully connected top-parts of the network typically contains 
95% of the parameters—^keeping the nodes synchronized is difficult due to the 
large size of the matrices. With batchwise dropout, nodes could communicate 
i9cost/instead of 9cost/dWk and so reducing the bandwidth needed. 

• Using independent dropout with recurrent neural networks can be too disruptive 
to allow effective learning; one solution is to only apply dropout to some parts 
of the network ca. Batchwise dropout may provide a less damaging form of 
dropout, as each unit will either be on or off for the whole time period. 

• Dropout is normally only used during training. It is generally more accurate 
use the whole network for testing purposes; this is equivalent to averaging over 
the ensemble of dropout patterns. Flowever, in a “real-time” setting, such as 
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analyzing successive frames from a video camera, it may be more efficient to 
use dropout during testing, and then to average the output of the network over 
time. 

• Nested dropout Cl is a variant of regular dropout that extends some of the prop¬ 
erties of PCA to deep networks. Batchwise nested dropout is particularly easy 
to implement as the submatrices are regular enough to qualify as matrices in the 
context of the SGEMM function (using the LDA argument). 

• DropConnect is an alternative form of regularization to dropout nni. Instead of 
dropping hidden units, individual elements of the weight matrix are dropped out. 
Using a modihcation similar to the one in Section lZT] there are opportunities for 
speeding up DropConnect training by approximately a factor of two. 
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A Fast dropout 


We might have called batchwise dropout fast dropout but that name is already taken 
im. Fast dropout is an alternative form of regularization that uses a probabilistic 
modeling technique to imitate the effect of dropout; each hidden unit is replaced with 
a Gaussian probability distribution. The/a^f relates to reducing the number of training 
epochs needed compared to regular dropout (with reference to results in a preprinjj 
of 111). Training a network 784-800-800-10 on the MNIST dataset with 20% input 
dropout and 50% hidden-layer dropout, fast dropout converges to a test error of 1.29% 
after 100 epochs of L-BFGS. This appears to be substantially better than the test error 
obtained in the preprint after 100 epochs of regular dropout training. 

However, this is a dangerous comparison to make. The authors of m used a 
learning-rate scheme designed to produce optimal accuracy eventually, not after just 
one hundred epochs. We tried using batchwise dropout with minibatches of size 100 
and an annealed learning rate of We trained a network with two 

hidden layers of 800 rectified linear units each. Training for 100 epochs resulted in a 
test error of 1.22% (s.d. 0.03%). After 200 epochs the test error has reduced further 
to 1.12% (s.d. 0.04%). Moreover, per epoch, batchwise-dropout is faster than regular 
dropout while fast-dropout is slower. Assuming we can make comparisons across dif¬ 
ferent program^ the 200 epochs of batchwise dropout training take less time than the 
100 epoch of fast dropout training. 


http://ai'xiv.org/abs/1207.0580 

^Using our software to implement the network, each batchwise dropout training epoch take 0.67 times 
as long as independent dropout. In ED a figures of 1.5 is given for the ratio between fast- and independent- 
dropout when using minibatch SGD; when using L-BFGS to train fast-dropout networks the training time per 
epoch will presumably be even more than 1.5 times longer, as L-BFGS use line-searches requiring additional 
forward passes through the neural network. 
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